12 research outputs found
Toy Models of Superposition
Neural networks often pack many unrelated concepts into a single neuron - a
puzzling phenomenon known as 'polysemanticity' which makes interpretability
much more challenging. This paper provides a toy model where polysemanticity
can be fully understood, arising as a result of models storing additional
sparse features in "superposition." We demonstrate the existence of a phase
change, a surprising connection to the geometry of uniform polytopes, and
evidence of a link to adversarial examples. We also discuss potential
implications for mechanistic interpretability.Comment: Also available at
https://transformer-circuits.pub/2022/toy_model/index.htm
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
We describe our early efforts to red team language models in order to
simultaneously discover, measure, and attempt to reduce their potentially
harmful outputs. We make three main contributions. First, we investigate
scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B
parameters) and 4 model types: a plain language model (LM); an LM prompted to
be helpful, honest, and harmless; an LM with rejection sampling; and a model
trained to be helpful and harmless using reinforcement learning from human
feedback (RLHF). We find that the RLHF models are increasingly difficult to red
team as they scale, and we find a flat trend with scale for the other model
types. Second, we release our dataset of 38,961 red team attacks for others to
analyze and learn from. We provide our own analysis of the data and find a
variety of harmful outputs, which range from offensive language to more subtly
harmful non-violent unethical outputs. Third, we exhaustively describe our
instructions, processes, statistical methodologies, and uncertainty about red
teaming. We hope that this transparency accelerates our ability to work
together as a community in order to develop shared norms, practices, and
technical standards for how to red team language models
Language Models (Mostly) Know What They Know
We study whether language models can evaluate the validity of their own
claims and predict which questions they will be able to answer correctly. We
first show that larger models are well-calibrated on diverse multiple choice
and true/false questions when they are provided in the right format. Thus we
can approach self-evaluation on open-ended sampling tasks by asking models to
first propose answers, and then to evaluate the probability "P(True)" that
their answers are correct. We find encouraging performance, calibration, and
scaling for P(True) on a diverse array of tasks. Performance at self-evaluation
further improves when we allow models to consider many of their own samples
before predicting the validity of one specific possibility. Next, we
investigate whether models can be trained to predict "P(IK)", the probability
that "I know" the answer to a question, without reference to any particular
proposed answer. Models perform well at predicting P(IK) and partially
generalize across tasks, though they struggle with calibration of P(IK) on new
tasks. The predicted P(IK) probabilities also increase appropriately in the
presence of relevant source materials in the context, and in the presence of
hints towards the solution of mathematical word problems. We hope these
observations lay the groundwork for training more honest models, and for
investigating how honesty generalizes to cases where models are trained on
objectives other than the imitation of human writing.Comment: 23+17 pages; refs added, typos fixe
Specific versus General Principles for Constitutional AI
Human feedback can prevent overtly harmful utterances in conversational
models, but may not automatically mitigate subtle problematic behaviors such as
a stated desire for self-preservation or power. Constitutional AI offers an
alternative, replacing human feedback with feedback from AI models conditioned
only on a list of written principles. We find this approach effectively
prevents the expression of such behaviors. The success of simple principles
motivates us to ask: can models learn general ethical behaviors from only a
single written principle? To test this, we run experiments using a principle
roughly stated as "do what's best for humanity". We find that the largest
dialogue models can generalize from this short constitution, resulting in
harmless assistants with no stated interest in specific motivations like power.
A general principle may thus partially avoid the need for a long list of
constitutions targeting potentially harmful behaviors. However, more detailed
constitutions still improve fine-grained control over specific types of harms.
This suggests both general and specific principles have value for steering AI
safely
Scaling Laws and Interpretability of Learning from Repeated Data
Recent large language models have been trained on vast datasets, but also
often on repeated data, either intentionally for the purpose of upweighting
higher quality data, or unintentionally because data deduplication is not
perfect and the model is exposed to repeated data at the sentence, paragraph,
or document level. Some works have reported substantial negative performance
effects of this repeated data. In this paper we attempt to study repeated data
systematically and to understand its effects mechanistically. To do this, we
train a family of models where most of the data is unique but a small fraction
of it is repeated many times. We find a strong double descent phenomenon, in
which repeated data can lead test loss to increase midway through training. A
predictable range of repetition frequency leads to surprisingly severe
degradation in performance. For instance, performance of an 800M parameter
model can be degraded to that of a 2x smaller model (400M params) by repeating
0.1% of the data 100 times, despite the other 90% of the training tokens
remaining unique. We suspect there is a range in the middle where the data can
be memorized and doing so consumes a large fraction of the model's capacity,
and this may be where the peak of degradation occurs. Finally, we connect these
observations to recent mechanistic interpretability work - attempting to
reverse engineer the detailed computations performed by the model - by showing
that data repetition disproportionately damages copying and internal structures
associated with generalization, such as induction heads, providing a possible
mechanism for the shift from generalization to memorization. Taken together,
these results provide a hypothesis for why repeating a relatively small
fraction of data in large language models could lead to disproportionately
large harms to performance.Comment: 23 pages, 22 figure
Overexpression of arginase alters circulating and tissue amino acids and guanidino compounds and affects neuromotor behavior in mice
Arginine is an intermediate of the ornithine cycle and serves as a precursor for the synthesis of nitric oxide, creatine, agmatine and proteins. It is considered to be a conditionally essential amino acid because endogenous synthesis only barely meets daily requirements. In rapidly growing suckling neonates, endogenous arginine biosynthesis is crucial to compensate for the insufficient supply of arginine via the milk. Evidence is accumulating that the intestine rather than the kidney plays a major role in arginine synthesis in this period. Accordingly, ectopic expression of hepatic arginase in murine enterocytes by genetic modification induces a selective arginine deficiency. The ensuing phenotype, whose severity correlates with the level of transgene expression in the enterocytes, could be reversed with arginine supplementation. We analyzed the effect of arginine deficiency on guanidine metabolism and neuromotor behavior. Arginine-deficient transgenic mice continued to suffer from an arginine deficiency after the arginine biosynthetic enzymes had disappeared from the enterocytes. Postweaning catch-up growth in arginine-deficient mice was characterized by increased levels of all measured amino acids except arginine. Furthermore, plasma total amino acid concentration, including arginine, was significantly lower in adult male than in adult female transgenic mice. Decreases in the concentration of plasma and tissue arginine led to significant decreases in most metabolites of arginine. However, the accumulation of the toxic guanidino compounds, guanidinosuccinic acid and methylguanidine, corresponded inversely with circulating arginine concentration, possibly reflecting a higher oxidative stress under hypoargininemic conditions. In addition, hypoargininemia was associated with disturbed neuromotor behavior, although brain levels of toxic guanidino compounds and ammonia were normal